A Level-wise Hierarchical Document Clustering method for Categorization
نویسندگان
چکیده
For document categorization, numerous words appearing in similar documents are divided into stopwords and keywords and to precisely describe documentary characteristics, documents are expressed by keywords without stopwords. For enhanced clustering precision, this paper proposed SHODC algorithm, a seed cluster-based hierarchical document clustering method, and DHODC method through domain stopwrod removal and tree structure expansion for document categorization. Through several experiments, it was found that the deeper the domain levels, the more precise results were produced by the suggested method compared to other algorithm. The suggested algorithm.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملHierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics
This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...
متن کاملWeb Documents Categorization using Fuzzy Representation and HAC
Most of the existing techniques for characterization of Web documents are based on term-frequent), analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. Howevel; as Web documents written in HTML are semi-structured documents by means of tags, the traditional techniques that assign term weights only by the ...
متن کاملHierarchical Bayesian Clustering for Automatic Text Classification
Text classification, the grouping of texts into several clusters, has been used as a means of improving both the efficiency and the effectiveDess of text retrieval/categorization In this paper we propose a hierarchical clustering algor i thm that constructs a Bet of clusters having the maximum Bayesian posterior probability, the probability that the given texts are classified into clusters We c...
متن کاملGroup-wise registration of large image dataset by hierarchical clustering and alignment
Group-wise registration has been proposed recently for consistent registration of all images in the same dataset. Since all images need to be registered simultaneously with lots of deformation parameters to be optimized, the number of images that the current group-wise registration methods can handle is limited due to the capability of CPU and physical memory in a general computer. To overcome ...
متن کامل